City College of San Francisco


MATH 108 - Foundations of Data Science

Lecture 02: Cause and Effect¶

Associated Textbook Sections: 2.0, 2.1, 2.2, 2.3. 2.4, 2.5

Overview¶

  • Associations
  • A Data Science Origin Story
  • Causation
  • Confounding Variables

Associations¶

Regularly Eating Chocolate Is Linked to 8 Percent Lower Heart Attack Risk¶

Image and Headline Source: everydayhealth.com

No description has been provided for this image

Study Source: European Journal of Preventive Cardiology

Study Observations¶

  • Individuals (study subjects, participants, units, etc.)
    • 336,289 US, Swedish, and Australian adults in several studies.
  • Treatment
    • Chocolate consumption
  • Outcome
    • Coronary heart disease risk

An Initial Question¶

Is there an association between chocolate consumption and heart disease risk?

An Answer¶

Yes, the reviewed article in the European Journal of Preventive Cardiology concludes that those consumed chocolate more than 1 time per week or more than 3.5 times per month were associated with fewer cases of heart disease compared with those that didn't.

A Follow Up Question¶

Does chocolate consumption lead to a reduction in heart disease? This question is often harder to answer.

Causality

An Answer¶

No, there are several factors that could explain why fewer people that consumed chocolate regularly developed heart disease. For example, better health care access could explain financial freedom to consume more foods like chocolate and explain less cases of heart disease.

“Dr. Alice Lichtenstein, an American Heart Association volunteer and professor of nutrition science and policy at Tufts University, was more skeptical of the findings.”


A Data Science Origin Story¶

London, Early 1850’s¶

Image Source: Wikipedia - 1954 Broad Street Cholera Outbreak

No description has been provided for this image

Miasmas, Miasmatism, Miasmatists¶

  • Bad smells given off by waste and rotting matter
  • Believed to be the main source of disease
  • Staunch believers:
    • Florence Nightingale (founder of modern nursing)
    • Edwin Chadwick (Commissioner of the General Board of Health)

Suggested Remedies¶

Cholera, around 1850¶

  • “fly to clene air”
  • “a pocket full o’posies”
  • “fire off barrels of gunpowder”

This might seem strange ...

COVID-19, 2020¶

  • Inject disinfectant
  • Sunlight
  • Hydroxychloroquine
  • Take 6 deep breaths, then cough while covering mouth
  • Cannabis, cocaine, mangoes, onion, garlic, drinking water every 15 minutes, tea, eating ice cream, avoiding ice cream

John Snow, 1813-1858¶

No description has been provided for this image

Cholera Map¶

Image and Text Source: National Geographic - Mapping A London Epidemic

According to the National Geographic Society,

"This map of London was created by John Snow in 1854. London was experiencing a deadly cholera epidemic, when Snow tracked the cases on this map. The cholera cases are highlighted in black. Using this map, Snow and other scientists were able to trace the cholera outbreak to a single infected water pump."

No description has been provided for this image
In [1]:
from IPython.display import IFrame
IFrame(src="https://www.google.com/maps/embed?pb=!1m18!1m12!1m3!1d2482.9971371478814!2d-\
            0.13879218398430104!3d51.51326851809472!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13\
            .1!3m3!1m2!1s0x487604d4eb49ec6d%3A0xc4ff84518f83499d!2sJohn%20Snow!5e0!3m2!1\
            sen!2sus!4v1642117611191!5m2!1sen!2sus", 
       width=800, height=600)
Out[1]:

Causation¶

London Water Supply Service Regions¶

Image Source: British Library - John Snow's map showing the water supply in London, 1855

Image NOTE:

  • Blue - Southwark and Vauxhall Company
  • Red - Lambeth Company
  • Purple - The area in which the pipes of both Companies are intermingled.

No description has been provided for this image

Comparison¶

  • Treatment group
  • Control group
    • Does not receive the treatment

Snow’s “Grand Experiment” ... Study¶

“… there is no difference whatever in the houses or the people receiving the supply of the two Water Companies, or in any of the physical conditions with which they are surrounded …”

The two groups were similar except for the treatment.

Snow's Table¶

Python Imports and Settings¶

In [2]:
from datascience import *
import numpy as np
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
%matplotlib inline
In [3]:
snows_table = Table(['Supply Area', 'Number of Houses', 'Cholera Deaths']).with_rows([
    ['S&V', 40046, 1263], 
    ['Lambeth', 26107, 98],
    ['Rest of London', 256423, 1422]
])
snows_table
Out[3]:
Supply Area Number of Houses Cholera Deaths
S&V 40046 1263
Lambeth 26107 98
Rest of London 256423 1422

To compare the deaths totals in various supply areas, calculate the relative frequency of deaths per household.

In [4]:
death_per_house = snows_table.column('Cholera Deaths') / snows_table.column('Number of Houses')
snows_table.with_column('Deaths per House', 
                        death_per_house)
Out[4]:
Supply Area Number of Houses Cholera Deaths Deaths per House
S&V 40046 1263 0.0315387
Lambeth 26107 98 0.00375378
Rest of London 256423 1422 0.00554552

Scale and round the rates to show whole numbers.

In [5]:
deaths_per_10000_houses = snows_table.column('Cholera Deaths') / snows_table.column('Number of Houses') * 10000
snows_table.with_column('Deaths per 10,000 Houses', 
                        np.round(deaths_per_10000_houses))
Out[5]:
Supply Area Number of Houses Cholera Deaths Deaths per 10,000 Houses
S&V 40046 1263 315
Lambeth 26107 98 38
Rest of London 256423 1422 55

Scaling rates a common presentation technique. This can provide clarity, but it can also be misleading!

Image Source: CDC - Rates of COVID-19-Associated Hospitalization (Updated Jan 8, 2022)

No description has been provided for this image

A Key to Establishing Causality¶

If the treatment and control groups are similar apart from the treatment, then differences between the outcomes in the two groups can be ascribed to the treatment.


Confounding Variables¶

Confounding Factors Weaken a Causal Argument¶

  • If the treatment and control groups have systematic differences other than the treatment, then it might be difficult to identify causality.

  • Such differences are often present in observational studies.

  • When they lead researchers astray, they are called confounding factors.

Example of a Confounding Relationship¶

No description has been provided for this image

Randomize! to Strengthen a Causal Argument¶

  • If you assign individuals to treatment and control at random, then the two groups are likely to be similar apart from the treatment.
  • You can (mathematically) account for variability in the assignment.
  • Randomized Controlled Experiment:
    • Randomly assign individuals to treatments
    • Ensure one treatment is a control where there outcome is understood.

Be Careful ...¶

Regardless of what the dictionary says, in probability theory

Random ≠ Haphazard

Adopted from UC Berkeley DATA 8 course materials.

This content is offered under a CC Attribution Non-Commercial Share Alike license.